{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pima Indians Diabetes Data Set数据探索\n",
    "\n",
    "数据说明：\n",
    "Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集） 根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。   \n",
    "\n",
    "数据集共9个字段: \n",
    "0列为怀孕次数；\n",
    "1列为口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度；\n",
    "2列为舒张压（单位:mm Hg）\n",
    "3列为三头肌皮褶厚度（单位：mm）\n",
    "4列为餐后血清胰岛素（单位:mm）\n",
    "5列为体重指数（体重（公斤）/ 身高（米）^2）\n",
    "6列为糖尿病家系作用\n",
    "7列为年龄\n",
    "8列为分类变量（0或1）\n",
    "\n",
    "数据链接：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "import必要的工具包，用于文件读取／特征编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据文件路径和文件名"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n0          6                           148              72   \n1          1                            85              66   \n2          8                           183              64   \n3          1                            89              66   \n4          0                           137              40   \n\n   Triceps_skin_fold_thickness  serum_insulin   BMI  \\\n0                           35              0  33.6   \n1                           29              0  26.6   \n2                            0              0  23.3   \n3                           23             94  28.1   \n4                           35            168  43.1   \n\n   Diabetes_pedigree_function  Age  Target  \n0                       0.627   50       1  \n1                       0.351   31       0  \n2                       0.672   32       1  \n3                       0.167   21       0  \n4                       2.288   33       1  \n"
     ]
    }
   ],
   "source": [
    "# input data\n",
    "train = pd.read_csv(\"data/pima_indians_diabetes/pima-indians-diabetes.csv\")\n",
    "print(train.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 768 entries, 0 to 767\nData columns (total 9 columns):\npregnants                       768 non-null int64\nPlasma_glucose_concentration    768 non-null int64\nblood_pressure                  768 non-null int64\nTriceps_skin_fold_thickness     768 non-null int64\nserum_insulin                   768 non-null int64\nBMI                             768 non-null float64\nDiabetes_pedigree_function      768 non-null float64\nAge                             768 non-null int64\nTarget                          768 non-null int64\ndtypes: float64(2), int64(7)\nmemory usage: 54.0 KB\nNone\n"
     ]
    }
   ],
   "source": [
    "print(train.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "粗看数据集没有缺失值\n",
    "但该数据集已知存在缺失值，某些列中存在的缺失值被标记为0。通过这些列中指标的定义和相应领域的常识可以证实上述观点，譬如体重指数和血压两列中的0作为指标数值来说是无意义的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        pregnants  Plasma_glucose_concentration  blood_pressure  \\\ncount  768.000000                    768.000000      768.000000   \nmean     3.845052                    120.894531       69.105469   \nstd      3.369578                     31.972618       19.355807   \nmin      0.000000                      0.000000        0.000000   \n25%      1.000000                     99.000000       62.000000   \n50%      3.000000                    117.000000       72.000000   \n75%      6.000000                    140.250000       80.000000   \nmax     17.000000                    199.000000      122.000000   \n\n       Triceps_skin_fold_thickness  serum_insulin         BMI  \\\ncount                   768.000000     768.000000  768.000000   \nmean                     20.536458      79.799479   31.992578   \nstd                      15.952218     115.244002    7.884160   \nmin                       0.000000       0.000000    0.000000   \n25%                       0.000000       0.000000   27.300000   \n50%                      23.000000      30.500000   32.000000   \n75%                      32.000000     127.250000   36.600000   \nmax                      99.000000     846.000000   67.100000   \n\n       Diabetes_pedigree_function         Age      Target  \ncount                  768.000000  768.000000  768.000000  \nmean                     0.471876   33.240885    0.348958  \nstd                      0.331329   11.760232    0.476951  \nmin                      0.078000   21.000000    0.000000  \n25%                      0.243750   24.000000    0.000000  \n50%                      0.372500   29.000000    0.000000  \n75%                      0.626250   41.000000    1.000000  \nmax                      2.420000   81.000000    1.000000  \n"
     ]
    }
   ],
   "source": [
    "# 查看数值型特征的基本统计量\n",
    "print(train.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从结果中我们可以看到很多列的最小值为0。而在一些特定列代表的变量中，0值并没有意义，这就表名该值无效或为缺失值。\n",
    "\n",
    "具体来说，下列变量的最小值为0时数据无意义：\n",
    "1、血浆葡萄糖浓度\n",
    "2、舒张压\n",
    "3、肱三头肌皮褶厚度\n",
    "4、餐后血清胰岛素\n",
    "5、体重指数\n",
    "\n",
    "在Pandas的DataFrame中，通过replace()函数可以很方便的将我们感兴趣的数据子集的值标记为NaN。\n",
    "\n",
    "标记完缺失值之后，可以利用isnull()函数将数据集中所有的NaN值标记为True，然后就可以得到每一列中缺失值的数量了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                         0\n",
      "Plasma_glucose_concentration      5\n",
      "blood_pressure                   35\n",
      "Triceps_skin_fold_thickness     227\n",
      "serum_insulin                   374\n",
      "BMI                              11\n",
      "Diabetes_pedigree_function        0\n",
      "Age                               0\n",
      "Target                            0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "NaN_col_names = ['Plasma_glucose_concentration',\n",
    "                 'blood_pressure',\n",
    "                 'Triceps_skin_fold_thickness',\n",
    "                 'serum_insulin',\n",
    "                 'BMI']\n",
    "train[NaN_col_names] = train[NaN_col_names].replace(0, np.NaN)\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对缺失值较多的特征，新增一个特征，表示这个特征是否缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Triceps_skin_fold_thickness  Triceps_skin_fold_thickness_Missing\n0                           35                                    0\n1                           29                                    0\n2                            0                                    0\n3                           23                                    0\n4                           35                                    0\n5                            0                                    0\n6                           32                                    0\n7                            0                                    0\n8                           45                                    0\n9                            0                                    0\n"
     ]
    }
   ],
   "source": [
    "# 缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['Triceps_skin_fold_thickness_Missing'] = train['Triceps_skin_fold_thickness']\\\n",
    "    .apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "print(train[['Triceps_skin_fold_thickness', 'Triceps_skin_fold_thickness_Missing']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0xbeeb1b0>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAYH0lEQVR4nO3de7BcZZnv8e9DQkyQQEgIDGQDQUBKIkPAGPCCo4RzQFQIKhQcPXIVrWK8DCPKTFmCEarUM0eQwUGjKMELl4MKgRFQQMzoDIYQAoRQHILOIZtbQoAMTMhg4nP+6HcvmqR30kl67d47+X6qdvVa73p79dPdSf/6XWv1WpGZSJIEsE23C5AkDR6GgiSpYihIkiqGgiSpYihIkirDu13A5th5551z4sSJ3S5DkoaUe++999nMHN9q2ZAOhYkTJzJv3rxulyFJQ0pE/L/+lrn5SJJUMRQkSRVDQZJUGdL7FCSpW/70pz/R29vLqlWrul1Kv0aOHElPTw/bbrtt2/cxFCRpE/T29jJ69GgmTpxIRHS7nHVkJsuXL6e3t5e999677fu5+UiSNsGqVasYN27coAwEgIhg3LhxGz2SqTUUIuLfI+LBiFgQEfNK29iI+FVEPFpudyrtERGXRsTiiHggIg6pszZJ2lyDNRD6bEp9AzFSeE9mTs7MKWX+POCOzNwPuKPMA7wX2K/8nQVcPgC1SZKadGOfwnHAu8v0LOAu4Aul/apsXODh7ogYExG7ZeZTXahRkjbK8uXLmTZtGgBPP/00w4YNY/z4xo+G586dy4gRIzr+mPPnz2fp0qUcffTRHVtn3aGQwC8jIoHvZOZMYNe+D/rMfCoidil9JwBLmu7bW9peEwoRcRaNkQR77rlnzeW35y3nXtXtEjTI3Pu/PtbtEjTAxo0bx4IFCwC44IIL2H777fnc5z7X9v3XrFnDsGHDNuox58+fz8KFCzsaCnVvPnpHZh5CY9PQ2RHxrvX0bbXxa53LwmXmzMyckplT+lJYkgazD3zgA7zlLW9h0qRJfO973wNg9erVjBkzhi9+8YtMnTqVuXPnMnv2bPbff38OP/xwPvWpTzF9+nQAXnrpJU499VSmTp3KwQcfzE033cTLL7/MjBkz+PGPf8zkyZO5/vrrO1JrrSOFzHyy3C6NiJ8DU4Fn+jYLRcRuwNLSvRfYo+nuPcCTddYnSQNh1qxZjB07lpUrVzJlyhQ+9KEPMXr0aFasWMEhhxzChRdeyMqVK3njG9/I7373O/bcc09OPPHE6v4zZszg6KOP5sorr+T555/n0EMP5YEHHuBLX/oSCxcu5JJLLulYrbWNFCLi9RExum8a+O/AQmA2cErpdgpwY5meDXysHIV0GLDC/QmStgQXX3wxBx10EG9729vo7e3lscceA2DEiBEcf/zxACxatIj999+fvfbai4jg5JNPru7/y1/+kosuuojJkyfznve8h1WrVvH444/XUmudI4VdgZ+XQ6KGAz/JzFsj4h7guog4A3gcOKH0/wVwDLAYWAmcVmNtkjQgbr/9dubMmcPdd9/NqFGjeOc731n9dmDUqFHVYaONY2xay0xuuOEG9tlnn9e0z5kzp+P11jZSyMw/ZOZB5W9SZl5U2pdn5rTM3K/cPlfaMzPPzsx9MvPAzPSc2JKGvBUrVjB27FhGjRrFQw89xD333NOy36RJk3jkkUdYsmQJmcm1115bLTvqqKO49NJLq/n77rsPgNGjR/Piiy92tF5/0SxJNXrf+97HypUrOeigg5gxYwaHHnpoy37bbbcdl112GUceeSSHH344u+++OzvuuCMA559/PitXruTAAw9k0qRJXHDBBQAcccQR3H///Rx88MFDY0ezJG2N+j60oXFSuttuu61lvxdeeOE180ceeSSPPPIImcknPvEJpkxp/Ob39a9/Pd/97nfXuf/48eM7fqExRwqSNEhcfvnlTJ48mQMOOICXX36Zj3/84wNegyMFSRokzj33XM4999yu1uBIQZJUMRQkSRVDQZJUMRQkSRV3NEtSB3T6bMntnGn31ltv5TOf+Qxr1qzhzDPP5LzzztvgfTbEkYIkDUFr1qzh7LPP5pZbbmHRokVcffXVLFq0aLPXayhI0hA0d+5c9t13X97whjcwYsQITjrpJG688cYN33EDDAVJGoKeeOIJ9tjj1asN9PT08MQTT2z2eg0FSRqCWp1Vte+Mq5vDUJCkIainp4clS169gnFvby+77777Zq/XUJCkIeitb30rjz76KH/84x955ZVXuOaaazj22GM3e70ekipJHdDOIaSdNHz4cC677DKOOuoo1qxZw+mnn86kSZM2f70dqE2S1AXHHHMMxxxzTEfX6eYjSVLFUJAkVQwFSVLFUJAkVQwFSVLFUJAkVTwkVZI64PEZB3Z0fXt+6cEN9jn99NO5+eab2WWXXVi4cGFHHteRgiQNUaeeeiq33nprR9dpKEjSEPWud72LsWPHdnSdhoIkqWIoSJIqhoIkqWIoSJIqHpIqSR3QziGknXbyySdz11138eyzz9LT08OXv/xlzjjjjM1ap6EgSUPU1Vdf3fF11r75KCKGRcR9EXFzmd87In4fEY9GxLURMaK0v67MLy7LJ9ZdmyTptQZin8JngIeb5r8GXJyZ+wHPA31jnTOA5zNzX+Di0k+SNIBqDYWI6AHeB3yvzAdwBHB96TILmF6mjyvzlOXTSn9JGpQys9slrNem1Ff3SOES4PPAn8v8OOCFzFxd5nuBCWV6ArAEoCxfUfq/RkScFRHzImLesmXL6qxdkvo1cuRIli9fPmiDITNZvnw5I0eO3Kj71bajOSLeDyzNzHsj4t19zS26ZhvLXm3InAnMBJgyZcrgfDckbfF6enro7e1lMH85HTlyJD09PRt1nzqPPnoHcGxEHAOMBHagMXIYExHDy2igB3iy9O8F9gB6I2I4sCPwXI31SdIm23bbbdl77727XUbH1bb5KDP/LjN7MnMicBJwZ2Z+BPg18OHS7RTgxjI9u8xTlt+Zg3VcJklbqG78ovkLwDkRsZjGPoMrSvsVwLjSfg5wXhdqk6St2oD8eC0z7wLuKtN/AKa26LMKOGEg6pEktea5jyRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklSpLRQiYmREzI2I+yPioYj4cmnfOyJ+HxGPRsS1ETGitL+uzC8uyyfWVZskqbU6Rwr/BRyRmQcBk4GjI+Iw4GvAxZm5H/A8cEbpfwbwfGbuC1xc+kmSBlBtoZANL5XZbctfAkcA15f2WcD0Mn1cmacsnxYRUVd9kqR11bpPISKGRcQCYCnwK+Ax4IXMXF269AITyvQEYAlAWb4CGFdnfZKk16o1FDJzTWZOBnqAqcCbWnUrt61GBbl2Q0ScFRHzImLesmXLOlesJGlgjj7KzBeAu4DDgDERMbws6gGeLNO9wB4AZfmOwHMt1jUzM6dk5pTx48fXXbokbVXqPPpofESMKdOjgCOBh4FfAx8u3U4BbizTs8s8ZfmdmbnOSEGSVJ/hG+6yyXYDZkXEMBrhc11m3hwRi4BrIuJC4D7gitL/CuCHEbGYxgjhpBprkyS10FYoRMQdmTltQ23NMvMB4OAW7X+gsX9h7fZVwAnt1CNJqsd6QyEiRgLbATtHxE68ujN4B2D3mmuTJA2wDY0UPgF8lkYA3MurofAfwLdqrEuS1AXrDYXM/CbwzYj4VGb+4wDVJEnqkrb2KWTmP0bE24GJzffJzKtqqkuS1AXt7mj+IbAPsABYU5oTMBQkaQvS7iGpU4AD/N2AJG3Z2v3x2kLgL+osRJLUfe2OFHYGFkXEXBqnxAYgM4+tpSpJUle0GwoX1FmEtKV5fMaB3S5Bg9CeX3qw2yVsULtHH/2m7kIkSd3X7tFHL/LqaaxH0Lhgzn9m5g51FSZJGnjtjhRGN89HxHRanL9IkjS0bdKpszPzBhqX1ZQkbUHa3Xz0wabZbWj8bsHfLEjSFqbdo48+0DS9Gvh34LiOVyNJ6qp29ymcVnchkqTua2ufQkT0RMTPI2JpRDwTET+NiJ66i5MkDax2dzT/gMY1lHcHJgA3lTZJ0hak3VAYn5k/yMzV5e9KYHyNdUmSuqDdUHg2Ij4aEcPK30eB5XUWJkkaeO2GwunAicDTwFPAhwF3PkvSFqbdQ1K/ApySmc8DRMRY4B9ohIUkaQvR7kjhL/sCASAznwMOrqckSVK3tBsK20TETn0zZaTQ7ihDkjREtPvB/r+Bf42I62mc3uJE4KLaqpIkdUW7v2i+KiLm0TgJXgAfzMxFtVYmSRpwbW8CKiFgEEjSFmyTTp0tSdoyGQqSpIqhIEmqGAqSpIqhIEmqGAqSpEptoRARe0TEryPi4Yh4KCI+U9rHRsSvIuLRcrtTaY+IuDQiFkfEAxFxSF21SZJaq3OksBr428x8E3AYcHZEHACcB9yRmfsBd5R5gPcC+5W/s4DLa6xNktRCbaGQmU9l5vwy/SLwMI2rth0HzCrdZgHTy/RxwFXZcDcwJiJ2q6s+SdK6BmSfQkRMpHFW1d8Du2bmU9AIDmCX0m0CsKTpbr2lbe11nRUR8yJi3rJly+osW5K2OrWHQkRsD/wU+Gxm/sf6urZoy3UaMmdm5pTMnDJ+vFcElaROqjUUImJbGoHw48z8WWl+pm+zULldWtp7gT2a7t4DPFlnfZKk16rz6KMArgAezsxvNC2aDZxSpk8Bbmxq/1g5CukwYEXfZiZJ0sCo80I57wD+J/BgRCwobX8PfBW4LiLOAB4HTijLfgEcAywGVuI1oCVpwNUWCpn5W1rvJwCY1qJ/AmfXVY8kacP8RbMkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqhoIkqWIoSJIqtYVCRHw/IpZGxMKmtrER8auIeLTc7lTaIyIujYjFEfFARBxSV12SpP7VOVK4Ejh6rbbzgDsycz/gjjIP8F5gv/J3FnB5jXVJkvpRWyhk5hzgubWajwNmlelZwPSm9quy4W5gTETsVldtkqTWBnqfwq6Z+RRAud2ltE8AljT16y1t64iIsyJiXkTMW7ZsWa3FStLWZrDsaI4WbdmqY2bOzMwpmTll/PjxNZclSVuXgQ6FZ/o2C5XbpaW9F9ijqV8P8OQA1yZJW72BDoXZwCll+hTgxqb2j5WjkA4DVvRtZpIkDZzhda04Iq4G3g3sHBG9wPnAV4HrIuIM4HHghNL9F8AxwGJgJXBaXXVJkvpXWyhk5sn9LJrWom8CZ9dViySpPYNlR7MkaRAwFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQxFCRJFUNBklQZVKEQEUdHxCMRsTgizut2PZK0tRk0oRARw4BvAe8FDgBOjogDuluVJG1dBk0oAFOBxZn5h8x8BbgGOK7LNUnSVmV4twtoMgFY0jTfCxy6dqeIOAs4q8y+FBGPDEBt0kbZC3YGnu12HRpkzo9uV9Bnr/4WDKZQaPVq5ToNmTOBmfWXI226iJiXmVO6XYe0sQbT5qNeYI+m+R7gyS7VIklbpcEUCvcA+0XE3hExAjgJmN3lmiRpqzJoNh9l5uqI+GvgNmAY8P3MfKjLZUmbyk2cGpIic53N9pKkrdRg2nwkSeoyQ0GSVDEUpA7zdC0aytynIHVQOV3L/wX+G43DrO8BTs7MRV0tTGqTIwWpszxdi4Y0Q0HqrFana5nQpVqkjWYoSJ3V1ulapMHKUJA6y9O1aEgzFKTO8nQtGtIGzWkupC2Bp2vRUOchqZKkipuPJEkVQ0GSVDEUJEkVQ0GSVDEUJEkVQ0GSVDEUtjIRMS4iFpS/pyPiiab5EWv1vS0iRner1rVFxG8jYnKL9k2qMyIOiIj7I+K+iJjYT5/hEfFCP8t+FBHT17P+cyJiZBvrOTsiPrKe9RwZETes77kMhIg4MyIyIv6qqe2E0ja9zP8gIvbfyPUeHxHndrpebRp/vLaVyczlwGSAiLgAeCkz/6G5T0QEjd+wHDXwFW68zajzg8D1mfmVTtbT5Bzg+8Cq9XXKzG/V9Ph1eBA4GfhNmT8JuL9vYWaetrErzMyfd6Y0dYIjBQEQEftGxMKI+DYwH9gtInojYkxZflpEPFC+Wf+gtO0aET+LiHkRMTciDivtF0bErIj4dUQ8GhGnl/YJ5dv+gvJYb++nluER8cOIeLD0+/Ray4eVb+kXlPneiBjT9ByuiIiHIuKWvm/qLR7jWOCvgU9GxO2l7fPl/gsj4lMt7rNNRPxTRCyKiJuAndfzev4NsAvwL33rL+1fLa/hv0XELk2v12fL9Bsj4s7SZ/7aI5iIOLSvvdzvioj4TUT8ISLObup3SnlPFpSat+nvdY2IvynP6f6I+FF/z6m4C3h7WdcOwJ7AwqbH/W1ETN6YxyojkEvK9I8i4psR8a/lOR1f2odFxLfL+3pTRNwa6xmladM5UlCzA4DTMvOTAI0BA0TEQcAXgLdn5nMRMbb0vxT4embeXT68bgbeXJYdCLwd2AGYHxH/DHwUuCkzvxaNi9GM6qeOtwA7Z+aB5fHHNC0bDvwEmJ+ZX2tx3/1pXNTmwYj4GTCdxjUNXiMzZ0fEVODZzLykTH+ExvUQhgFzI+I3QPPFcT4M7F2e4+5l2bdbPYHMvDgi/hY4PDNfiIjhwI7AbzLzvIj4BnA68NW17no1cEFm3lQCbRtg3/I6HA5cDBybmb3l/XkjMA0YAzwcjVB/E3A8jfdrdUTMpPGN/rF+XtfPA3tl5itrvdat/JlGMBwJ7ArcUB5vbf29h+081i7AO2j8G7oO+DlwAo1TkB8I/AXwMP289to8jhTU7LHMvKdF+xHAtZn5HEDfLY0Phm9HxAIaHw47RUTfB/0NmbkqM5cCc4C30jhZ3JkRcT7w5sx8qZ86FgP7l2+MRwErmpZdQf+BAI0L3DxYpu8FJm7gOfc5HPhpZq7MzBfL83nnWn3eBVydmX/OzF4aH44b4+XMvKW/2iJiJxofpDcBlNdvZVn8ZuCfgPeXx+5zc2a+Ul7n54DxNN6XtwLzynvzV8A+9P+6PgT8KBr7Nf7UxvO4hkbInESLwC0257FuyIYHePVaFO8Eriuv/ZO8uvlKHWYoqNl/9tMetL4mQABTM3Ny+ZuQmS+XZWv3z8y8E3g38BTw4+hn52rZ7/GXwG+BTwPfaVr8O2BaRLyun1r/q2l6De2PhltdB6FleW32a+WVpun+autv/U+W+6+9o73V8w0aJ+Lre1/2z8yvrOd1PYrGt+6pNIJk2Aaex78BhwA7ZOZjrTps5mM1P6dY61Y1MxTUjtuBk/o2GzVtProdaN6O3fyBNT0iXhcRO9P4Fj4vIvYCns7MmcCVwMGtHiwixtPY0f1/gPNpfAD1mVke95qySaZT5gDHR8SoiNiexiU0/6VFn5PK9vkJNL6Br8+LQNtHRWXm88CzEfEBgIgYGRHblcXPAe8Hvl42I63P7cCJ5bXvO+Jsz1ava/lQ7imBfS6NkcZ2/a241JnA3wF/31+fTj1Wk98CH46G3WiM2lQD9ylogzLzgYj4OjAnIlbT2PRxBo1AuDwiTqPxb+nXvBoS9wC30LjgzPmZ+Uw0djifExF/Al6isY+hlT2AK6Kx0Txp7M9orufrEXERcGVEfKxDz3FuRFxd6ga4vOyXaP4/cj3wHho7Vh+hERLrMxO4PSKWAEe3WcpHgO+U5/cK8KGmGp+Kxg7yX6zveZe6v1weexsam2k+SWMksfbrOhz4STQO6d0G+FrZfLZemfnPG+jS6j1s+Vh9+6424DoamzH7Xvvf89rNiuoQT52tjouICyk7cLtdi7YcEbF9Zr5URiG/Bw7NzGXdrmtL40hB0lBxSzkMdlsao08DoQaOFNRVETGPdb+c/I/MXNSq/yY+xreBw9Zq/kZmXtWh9c+mcbx+s89l5u2t+g92EXEmjd9wNJuTmZ9u1V9bFkNBklTx6CNJUsVQkCRVDAVJUsVQkCRV/j88dwwDSDFfVAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "%matplotlib inline\n",
    "sns.countplot(x=\"Triceps_skin_fold_thickness_Missing\", hue=\"Target\", data=train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0xcf7f350>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAViklEQVR4nO3df7RV5X3n8fc3/AioRATRES6KP4irEgPqDZpJzCRqKyGNkEStrKZiJCHtOC7Tjk6cTpYSqjPJalqNsbVjNBU7qT/GNIKOQQ0JcTWtIioi4hjRZOT6C0QkWiQK+c4fZ9/NAQ5ygLPvuRfer7XOOns/+9n7fM/Rxec+e+/znMhMJEkCeE+7C5Ak9R6GgiSpZChIkkqGgiSpZChIkkr9213A7jjwwANzzJgx7S5DkvqURx555NXMHNFoW58OhTFjxrB48eJ2lyFJfUpE/L/tbfP0kSSpZChIkkqGgiSp1KevKUhSu7zzzjt0dXWxYcOGdpeyXYMGDaKjo4MBAwY0vY+hIEm7oKuriyFDhjBmzBgiot3lbCMzWbNmDV1dXRx++OFN7+fpI0naBRs2bGD48OG9MhAAIoLhw4fv9Eim0lCIiF9FxBMRsSQiFhdtwyLi/oh4png+oGiPiLgmIlZExNKIOL7K2iRpd/XWQOi2K/X1xEjhE5k5ITM7i/VLgQWZORZYUKwDfBIYWzxmAtf1QG2SpDrtuKYwBfh4sTwHWAh8tWi/OWs/8PBgRAyNiEMy86U21ChJO2XNmjWceuqpALz88sv069ePESNqXxpetGgRAwcObPlrPvroo6xatYpJkya17JhVh0IC90VEAv8zM68HDu7+hz4zX4qIg4q+o4CVdft2FW1bhEJEzKQ2kuDQQw+tuPzmnHDJze0uQb3MI395brtLUA8bPnw4S5YsAWDWrFnst99+XHzxxU3vv2nTJvr167dTr/noo4+ybNmyloZC1aePPpKZx1M7NXRBRHzsXfo2Ovm1zc/CZeb1mdmZmZ3dKSxJvdmnP/1pTjjhBMaNG8cNN9wAwMaNGxk6dChf+9rXmDhxIosWLWLevHkcffTRnHzyyVx44YVMnToVgDfffJPzzjuPiRMnctxxx3HXXXfx1ltvMXv2bL7//e8zYcIE7rjjjpbUWulIITNfLJ5XRcQPgYnAK92nhSLiEGBV0b0LGF23ewfwYpX1SVJPmDNnDsOGDWP9+vV0dnbyuc99jiFDhrBu3TqOP/54rrjiCtavX8/73/9+fv7zn3PooYdy9tlnl/vPnj2bSZMmcdNNN7F27VpOPPFEli5dymWXXcayZcu4+uqrW1ZrZSOFiNg3IoZ0LwO/BywD5gHTi27TgbnF8jzg3OIupJOAdV5PkLQnuOqqqxg/fjwf/vCH6erq4tlnnwVg4MCBfOYznwFg+fLlHH300Rx22GFEBNOmTSv3v++++7jyyiuZMGECn/jEJ9iwYQPPP/98JbVWOVI4GPhhcUtUf+AfM3N+RDwM3B4RM4DngbOK/vcAk4EVwHrgCxXWJkk94sc//jEPPPAADz74IIMHD+ajH/1o+d2BwYMHl7eN1u6xaSwzufPOOznyyCO3aH/ggQdaXm9lI4XMfC4zxxePcZl5ZdG+JjNPzcyxxfNrRXtm5gWZeWRmHpuZzoktqc9bt24dw4YNY/DgwTz55JM8/PDDDfuNGzeOp59+mpUrV5KZ3HbbbeW2008/nWuuuaZcf+yxxwAYMmQIb7zxRkvr9RvNklShT33qU6xfv57x48cze/ZsTjzxxIb99tlnH6699lpOO+00Tj75ZEaOHMn+++8PwOWXX8769es59thjGTduHLNmzQLglFNO4fHHH+e4447rGxeaJWlv1P2PNtQmpbv33nsb9nv99de3WD/ttNN4+umnyUy+/OUv09lZ+87vvvvuy3e/+91t9h8xYkTLf2jMkYIk9RLXXXcdEyZM4JhjjuGtt97iS1/6Uo/X4EhBknqJSy65hEsuuaStNThSkCSVDAVJUslQkCSVDAVJUskLzZLUAq2eLbmZmXbnz5/PRRddxKZNm/jiF7/IpZdeusN9dsSRgiT1QZs2beKCCy7gRz/6EcuXL+eWW25h+fLlu31cQ0GS+qBFixZx1FFHccQRRzBw4EDOOecc5s6du+Mdd8BQkKQ+6IUXXmD06M2/NtDR0cELL7yw28c1FCSpD2o0q2r3jKu7w1CQpD6oo6ODlSs3/4JxV1cXI0eO3O3jGgqS1Ad96EMf4plnnuGXv/wlb7/9NrfeeitnnHHGbh/XW1IlqQWauYW0lfr378+1117L6aefzqZNmzj//PMZN27c7h+3BbVJktpg8uTJTJ48uaXH9PSRJKlkKEiSSoaCJKlkKEiSSoaCJKlkKEiSSt6SKkkt8PzsY1t6vEMve2KHfc4//3zuvvtuDjroIJYtW9aS13WkIEl91Hnnncf8+fNbekxDQZL6qI997GMMGzaspcc0FCRJJUNBklQyFCRJJUNBklTyllRJaoFmbiFttWnTprFw4UJeffVVOjo6+PrXv86MGTN265iGgiT1UbfcckvLj1n56aOI6BcRj0XE3cX64RHxUEQ8ExG3RcTAov29xfqKYvuYqmuTJG2pJ64pXAQ8Vbf+TeCqzBwLrAW6xzozgLWZeRRwVdFPktSDKg2FiOgAPgXcUKwHcApwR9FlDjC1WJ5SrFNsP7XoL0m9Uma2u4R3tSv1VT1SuBr4L8Bvi/XhwOuZubFY7wJGFcujgJUAxfZ1Rf8tRMTMiFgcEYtXr15dZe2StF2DBg1izZo1vTYYMpM1a9YwaNCgndqvsgvNEfH7wKrMfCQiPt7d3KBrNrFtc0Pm9cD1AJ2dnb3zv4akPV5HRwddXV305j9OBw0aREdHx07tU+XdRx8BzoiIycAg4H3URg5DI6J/MRroAF4s+ncBo4GuiOgP7A+8VmF9krTLBgwYwOGHH97uMlqustNHmflfM7MjM8cA5wA/ycw/BH4KnFl0mw7MLZbnFesU23+SvXVcJkl7qHZ8o/mrwJ9FxApq1wxuLNpvBIYX7X8GXNqG2iRpr9YjX17LzIXAwmL5OWBigz4bgLN6oh5JUmPOfSRJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqRSZaEQEYMiYlFEPB4RT0bE14v2wyPioYh4JiJui4iBRft7i/UVxfYxVdUmSWqsypHCb4BTMnM8MAGYFBEnAd8ErsrMscBaYEbRfwawNjOPAq4q+kmSelBloZA1bxarA4pHAqcAdxTtc4CpxfKUYp1i+6kREVXVJ0naVqXXFCKiX0QsAVYB9wPPAq9n5saiSxcwqlgeBawEKLavA4ZXWZ8kaUuVhkJmbsrMCUAHMBH4nUbdiudGo4LcuiEiZkbE4ohYvHr16tYVK0nqmbuPMvN1YCFwEjA0IvoXmzqAF4vlLmA0QLF9f+C1Bse6PjM7M7NzxIgRVZcuSXuVKu8+GhERQ4vlwcBpwFPAT4Ezi27TgbnF8rxinWL7TzJzm5GCJKk6/XfcZZcdAsyJiH7Uwuf2zLw7IpYDt0bEFcBjwI1F/xuBf4iIFdRGCOdUWJskqYGmQiEiFmTmqTtqq5eZS4HjGrQ/R+36wtbtG4CzmqlHklSNdw2FiBgE7AMcGBEHsPli8PuAkRXXJknqYTsaKXwZ+Aq1AHiEzaHwa+BvKqxLktQG7xoKmflt4NsRcWFmfqeHapIktUlT1xQy8zsR8e+BMfX7ZObNFdUlSWqDZi80/wNwJLAE2FQ0J2AoSNIepNlbUjuBY/zegCTt2Zr98toy4N9VWYgkqf2aHSkcCCyPiEXUpsQGIDPPqKQqSVJbNBsKs6osQtrTPD/72HaXoF7o0MueaHcJO9Ts3Uc/q7oQSVL7NXv30RtsnsZ6ILUfzPm3zHxfVYVJknpesyOFIfXrETGVBvMXSZL6tl2aOjsz76T2s5qSpD1Is6ePPlu3+h5q31vwOwuStIdp9u6jT9ctbwR+BUxpeTWSpLZq9prCF6ouRJLUfk1dU4iIjoj4YUSsiohXIuIHEdFRdXGSpJ7V7IXmv6f2G8ojgVHAXUWbJGkP0mwojMjMv8/MjcXjJmBEhXVJktqg2VB4NSI+HxH9isfngTVVFiZJ6nnNhsL5wNnAy8BLwJmAF58laQ/T7C2pfwFMz8y1ABExDPgWtbCQJO0hmh0pfLA7EAAy8zXguGpKkiS1S7Oh8J6IOKB7pRgpNDvKkCT1Ec3+w/5XwL9ExB3Uprc4G7iysqokSW3R7Deab46IxdQmwQvgs5m5vNLKJEk9rulTQEUIGASStAfbpamzJUl7JkNBklQyFCRJJUNBklQyFCRJJUNBklSqLBQiYnRE/DQinoqIJyPioqJ9WETcHxHPFM8HFO0REddExIqIWBoRx1dVmySpsSpHChuB/5yZvwOcBFwQEccAlwILMnMssKBYB/gkMLZ4zASuq7A2SVIDlYVCZr6UmY8Wy28AT1H71bYpwJyi2xxgarE8Bbg5ax4EhkbEIVXVJ0naVo9cU4iIMdRmVX0IODgzX4JacAAHFd1GASvrdusq2rY+1syIWBwRi1evXl1l2ZK016k8FCJiP+AHwFcy89fv1rVBW27TkHl9ZnZmZueIEf4iqCS1UqWhEBEDqAXC9zPzn4rmV7pPCxXPq4r2LmB03e4dwItV1idJ2lKVdx8FcCPwVGb+dd2mecD0Ynk6MLeu/dziLqSTgHXdp5kkST2jyh/K+QjwR8ATEbGkaPtz4BvA7RExA3geOKvYdg8wGVgBrMffgJakHldZKGTmP9P4OgHAqQ36J3BBVfVIknbMbzRLkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpZChIkkqGgiSpVFkoRMT3ImJVRCyraxsWEfdHxDPF8wFFe0TENRGxIiKWRsTxVdUlSdq+KkcKNwGTtmq7FFiQmWOBBcU6wCeBscVjJnBdhXVJkrajslDIzAeA17ZqngLMKZbnAFPr2m/OmgeBoRFxSFW1SZIa6+lrCgdn5ksAxfNBRfsoYGVdv66ibRsRMTMiFkfE4tWrV1darCTtbXrLheZo0JaNOmbm9ZnZmZmdI0aMqLgsSdq79HQovNJ9Wqh4XlW0dwGj6/p1AC/2cG2StNfr6VCYB0wvlqcDc+vazy3uQjoJWNd9mkmS1HP6V3XgiLgF+DhwYER0AZcD3wBuj4gZwPPAWUX3e4DJwApgPfCFquqSJG1fZaGQmdO2s+nUBn0TuKCqWiRJzektF5olSb2AoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqSSoSBJKhkKkqRSrwqFiJgUEU9HxIqIuLTd9UjS3qbXhEJE9AP+BvgkcAwwLSKOaW9VkrR36TWhAEwEVmTmc5n5NnArMKXNNUnSXqV/uwuoMwpYWbfeBZy4daeImAnMLFbfjIine6A2aaccBgcCr7a7DvUyl0e7K+h22PY29KZQaPRp5TYNmdcD11dfjrTrImJxZna2uw5pZ/Wm00ddwOi69Q7gxTbVIkl7pd4UCg8DYyPi8IgYCJwDzGtzTZK0V+k1p48yc2NE/CfgXqAf8L3MfLLNZUm7ylOc6pMic5vT9pKkvVRvOn0kSWozQ0GSVDIUpBZzuhb1ZV5TkFqomK7lF8DvUrvN+mFgWmYub2thUpMcKUit5XQt6tMMBam1Gk3XMqpNtUg7zVCQWqup6Vqk3spQkFrL6VrUpxkKUms5XYv6tF4zzYW0J3C6FvV13pIqSSp5+kiSVDIUJEklQ0GSVDIUJEklQ0GSVDIUJEklQ0EqRMQfR8S5LT7mTRFxZrF8Q0QcswvHmBURGRFH1bX9adHWWazfExFDd/K4LX+/6vv88pr6nIjon5kbW33czPy7Vh9zq+N/cTd2f4Lat6OvKNbPBMrpuDNz8i7UU+n7Vd/kSEFtExH7RsT/iYjHI2JZRPxBRJwQET+LiEci4t6IOKTouzAi/ntE/Ay4qP4v8GL7m8Xzx4v9b4+IX0TENyLiDyNiUUQ8ERFHvks9syLi4rrX+2ax3y8i4uSifVzRtiQilkbE2IgYExHL6o5zcUTManD8hXV/2b8ZEVcW7/3BiDh4Bx/XnRRTcEfEEcA6YHXdsX8VEQc2+kyL7d+IiOVFzd/aife7T/FZLo2I2yLioe73oD2ToaB2mgS8mJnjM/MDwHzgO8CZmXkC8D3gyrr+QzPzP2TmX+3guOOBi4BjgT8C3p+ZE4EbgAt3or7+xX5fAS4v2v4Y+HZmTgA6qU2Atyv2BR7MzPHAA8CXdtD/18DKiPgAMA24bTv9tvlMI2IY8BlgXGZ+kM2jja01er//EVhb7PcXwAnNvT31VYaC2ukJ4LTiL9STqc0u+gHg/ohYAnyN2iyj3bb3D+HWHs7MlzLzN8CzwH11rzdmJ+r7p+L5kbr9/hX484j4KnBYZr61E8er9zZwd4Pjv5tbqZ1Cmgr8cDt9tvhMM3MdtUDZANwQEZ8F1m9n30bv96PF65KZy4ClTdSpPsxQUNtk5i+o/eX5BPA/gM8BT2bmhOJxbGb+Xt0u/1a3vJHi/9+ICGBg3bbf1C3/tm79t+zcdbTu/TZ175eZ/wicAbwF3BsRp9TXUhjUxLHfyc0Tj5XH34G7qI18ns/MXzfqsPVnGhGXFddfJgI/oBYo87dz/G3eL41/H0J7MENBbRMRI4H1mfm/gG8BJwIjIuLDxfYBETFuO7v/is2nMqYAAyoul6KmI4DnMvMaalNifxB4BTgoIoZHxHuB36/itYtRyVfZ8pTa1vVt/ZkeHxH7Aftn5j3UTg1N2ImX/Wfg7OLYx1A7Jac9mHcfqZ2OBf4yIn4LvAP8CbW/uq+JiP2p/f95NdBo6unvAnMjYhGwgC1HEVX6A+DzEfEO8DIwOzPfiYjZwEPAL4H/W9WLZ+atO+jS6DMdQu2zGkTtL/8/3YmX/FtgTkQsBR6jdvpo3U4Xrj7DqbMlbVdE9AMGZOaG4s6tBdQu3L/d5tJUEUcKkt7NPsBPI2IAtVHGnxgIezZHCtrrRMR/A87aqvl/Z+Z2z9X3hN5al/YuhoIkqeTdR5KkkqEgSSoZCpKkkqEgSSr9f2Ux+dGx1vtFAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['serum_insulin_Missing'] = train['serum_insulin'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "sns.countplot(x=\"serum_insulin_Missing\", hue=\"Target\", data=train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不过特征是否缺失好像和目标也没什么关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train.drop([\"Triceps_skin_fold_thickness_Missing\", \"serum_insulin_Missing\"], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "感觉特征缺失是随机的，将这新增的特征删除，老实用中值填补算了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                       0\nPlasma_glucose_concentration    0\nblood_pressure                  0\nTriceps_skin_fold_thickness     0\nserum_insulin                   0\nBMI                             0\nDiabetes_pedigree_function      0\nAge                             0\nTarget                          0\ndtype: int64\n"
     ]
    }
   ],
   "source": [
    "medians = train.median() \n",
    "train = train.fillna(medians)\n",
    "\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# get labels\n",
    "y_train = train['Target']   \n",
    "X_train = train.drop([\"Target\"], axis=1)\n",
    "\n",
    "# 用于保存特征工程之后的结果\n",
    "feat_names = X_train.columns\n",
    "\n",
    "# 数据标准化\n",
    "# 初始化特征的标准化器\n",
    "ss_X = StandardScaler()\n",
    "\n",
    "# 分别对训练和测试数据的特征进行标准化处理\n",
    "X_train = ss_X.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 特征处理结果存为文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 存为csv格式\n",
    "X_train = pd.DataFrame(columns=feat_names, data=X_train)\n",
    "\n",
    "train = pd.concat([X_train, y_train], axis=1)\n",
    "\n",
    "train.to_csv('data/pima_indians_diabetes/FE_pima-indians-diabetes.csv', index=False, header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n0   0.639947                      0.848324        0.149641   \n1  -0.844885                     -1.123396       -0.160546   \n2   1.233880                      1.943724       -0.263941   \n3  -0.844885                     -0.998208       -0.160546   \n4  -1.141852                      0.504055       -1.504687   \n\n   Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n0                     0.907270      -0.692891  0.204013   \n1                     0.530902      -0.692891 -0.684422   \n2                    -1.288212      -0.692891 -1.103255   \n3                     0.154533       0.123302 -0.494043   \n4                     0.907270       0.765836  1.409746   \n\n   Diabetes_pedigree_function       Age  Target  \n0                    0.468492  1.425995       1  \n1                   -0.365061 -0.190672       0  \n2                    0.604397 -0.105584       1  \n3                   -0.920763 -1.041549       0  \n4                    5.484909 -0.020496       1  \n"
     ]
    }
   ],
   "source": [
    "print(train.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
